What is Prosper?
Prosper is a peer-to-peer lending marketplace. Similar to Kickstarter, where companies crowdsource money for projects, Prosper allows borrowers to crowdsource their loan from wealthier lenders, who get repaid directly with interest on the chunk of loan provided.
What is the Prosper Loan Dataset?
The dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.
## 'data.frame': 113937 obs. of 81 variables:
## $ ListingKey : Factor w/ 113066 levels "00003546482094282EF90E5",..: 7180 7193 6647 6669 6686 6689 6699 6706 6687 6687 ...
## $ ListingNumber : int 193129 1209647 81716 658116 909464 1074836 750899 768193 1023355 1023355 ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ ClosedDate : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1138 1 1263 1 1 1 1 1 1 1 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ LenderYield : num 0.138 0.082 0.24 0.0874 0.1985 ...
## $ EstimatedEffectiveYield : num NA 0.0796 NA 0.0849 0.1832 ...
## $ EstimatedLoss : num NA 0.0249 NA 0.0249 0.0925 ...
## $ EstimatedReturn : num NA 0.0547 NA 0.06 0.0907 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CurrentlyInGroup : Factor w/ 2 levels "False","True": 2 1 2 1 1 1 1 1 1 1 ...
## $ GroupKey : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 335 1 1 1 1 1 1 1 ...
## $ DateCreditPulled : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 14347 111883 6446 64724 85857 100382 72500 73937 97888 97888 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ OpenRevolvingMonthlyPayment : num 24 389 0 115 220 1410 214 101 219 219 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ TotalInquiries : num 3 5 1 1 9 2 0 16 6 6 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ PublicRecordsLast12Months : int 0 0 NA 0 0 0 0 0 0 0 ...
## $ RevolvingCreditBalance : num 0 3989 NA 1444 6193 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage. : num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanKey : Factor w/ 113066 levels "00003683605746079487FF7",..: 100337 69837 46303 70776 71387 86505 91250 5425 908 908 ...
## $ TotalProsperLoans : int NA NA NA NA 1 NA NA NA NA NA ...
## $ TotalProsperPaymentsBilled : int NA NA NA NA 11 NA NA NA NA NA ...
## $ OnTimeProsperPayments : int NA NA NA NA 11 NA NA NA NA NA ...
## $ ProsperPaymentsLessThanOneMonthLate: int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPaymentsOneMonthPlusLate : int NA NA NA NA 0 NA NA NA NA NA ...
## $ ProsperPrincipalBorrowed : num NA NA NA NA 11000 NA NA NA NA NA ...
## $ ProsperPrincipalOutstanding : num NA NA NA NA 9948 ...
## $ ScorexChangeAtTimeOfListing : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanCurrentDaysDelinquent : int 0 0 0 0 0 0 0 0 0 0 ...
## $ LoanFirstDefaultedCycleNumber : int NA NA NA NA NA NA NA NA NA NA ...
## $ LoanMonthsSinceOrigination : int 78 0 86 16 6 3 11 10 3 3 ...
## $ LoanNumber : int 19141 134815 6466 77296 102670 123257 88353 90051 121268 121268 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ LoanOriginationDate : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 426 1866 260 1535 1757 1821 1649 1666 1813 1813 ...
## $ LoanOriginationQuarter : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 18 8 2 32 24 33 16 16 33 33 ...
## $ MemberKey : Factor w/ 90831 levels "00003397697413387CAF966",..: 11071 10302 33781 54939 19465 48037 60448 40951 26129 26129 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ LP_CustomerPayments : num 11396 0 4187 5143 2820 ...
## $ LP_CustomerPrincipalPayments : num 9425 0 3001 4091 1563 ...
## $ LP_InterestandFees : num 1971 0 1186 1052 1257 ...
## $ LP_ServiceFees : num -133.2 0 -24.2 -108 -60.3 ...
## $ LP_CollectionFees : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_GrossPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NetPrincipalLoss : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LP_NonPrincipalRecoverypayments : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PercentFunded : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Recommendations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ InvestmentFromFriendsAmount : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Based on an initial scan of the data and its types, a few things pop up. The first is the data types of the columns. There are several columns with factors as well as integers as doubles. Furthermore, going through the different variables present in the data, a few stand out as interesting socioeconomic factors to investigate against the loan terms provided. Occupation, Employed Status, IsBorrowerHomeowner, and StatedMonthlyIncome are some examples of such variables that I would like to dig in on and how they affect LoanStatus, LoanOriginalAmount, ProsperScore, BorrowerAPR, Term, etc.
The variable documentation notes that variables such as CreditGrade, ProsperRating, and ProsperScore are only available before/after certain times. CreditGrade is available before 2009, and ProsperScore and ProsperRating are available after July 2009. To decide what impact this has and which variable I should consider, I wanted to know how many listings were created before and after July 2009.
## [1] 0.7447361
To compare the ListingCreationDate with a fixed date (July 2009), I needed to convert the factors into characters and then compare the two. When comparing the sum of the dates available afer July 2009 to the total number of listings, it turns out that over 74% of the listings were created after July 1, 2009. As a result, I will consider the ProsperRating and ProsperScore variables over the CreditGrade variable.
The Prosper Rating (numeric) ranges from 0 - 7 with the following correspondence: 0 - N/A, 1 - HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. The ratings follow a bell curve pattern.
I combined the factors that were Past Due into one group to get a cleaner histogram and a better sense of the status of the loans. Most loans are Current, followed by Completed loans, and then Chargedoff loans. While there are relatively few loans in the Past Due bucket, there are a lot that are Chargedoff, i.e., the loan is over 120 days past due, and several in the Default category.
It is unclear what the difference between a Chargeoff and Defaulted loan is based on Prosper’s current literature, however, a 2008 description of its API service listed defaulted loans under Delinquency, Bankruptcy, or Deceased. Therefore, I will assume that they are loans that are written off for identified reasons and will combine the Chargeoff and Defaulted loans in the future if analyzing them further.
Most people get loans on Prosper to consolidate their debt.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The most popular Loan Amount is $4000, followed by $15,000 and then $10,000. With the exception of values mentioned and those at rounded amounts (e.g. $20,000), as the loan value increases, the fewer loans are made at those amounts.
There are 3 terms that the loans fall into - 12 months, 36 months, and 60 months. The 36 months term is the most popular, followed by the 60 month term and finally the 12 month term at the end.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15629 0.20976 0.21883 0.28381 0.51229 25
Borrower’s APR (Annual Percentage Rate), the amount of interest owed including fees, follows a normal distribution, with the exception of the 36% rate being the most popular even though its close to the tail.
Socioeconomic Factors -
Most people selected the ‘Other’ category as their occupation, either because they did not fit within the occupations specified or they did not want to share their occupational status. The next 5 most popular professions are Professional, Computer Programmer, Executive, Teacher, and Analyst.
Most people who list loans are Employed, however, there are other categories such as Full-Time, Part-Time, Self-employed, which also relate to being employed. It would be more interesting to bucket these together and then compore the employed to the unemployed. It would be ideal to get more finer detail on the status of those employed, i.e., whether those individuals are full-time; part-time; or self-employed, but unfortunately we don’t havemore granular data for that group.
## [1] 18.87021
The data above indicates that for the loan listings in the dataframe, there are ~19 times more people classified as Employed than as ‘Not Available’, the next most popular category.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750003
The histogram of stated monthly income has a bell curve shape with the median skewed towards the left. The median monthly stated income is $4667 and is bounded by $3200 in the first quartile, and $5608 in the third quartile. The graph does not show the top 1 percentile of stated monthly incomes as at a maximum of $1,750,003 they heavily skew the graph.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
As CreditScoreRangeLower and CreditScoreRangeHigher both contain the same information with different classification names, I only need to look at one of them and decided to choose the CreditScoreRangeLower variable as it has round numbers. The distribution of the credit scores approximates a bell curve pushed to the far right of the credit score range. Most people have credit scores in the 680 bucket. The median is also 680, and the mean is close at 685.6.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The debt to income ratio follows a bell curve skewed to the left. This can properly be seen when removing outliers. The graph above shows the distrubtion when the top 1 percentile of the data has been removed. The median debt to income ratio is 0.22 and its bounded by 0.14 and 0.32 in the first and third quantile respectively.
The number of loans taken out over months are roughly similar. There is a small dip in April, while January, October, and December see the highest demand for loans.
The number of loans taken out through Prosper over the years follow some interesting patterns. Starting from 2005 to 2008 the number of loans were slowly increasing, until 2009 where numbers dropped close to 2005 levels. This is likely a result of the 2008 recession. However, from 2009 to 2013, Prosper has seen great growth in the number of loans taken through its platform, with 2013 loan amounts almost doubling 2012 loan amounts.
Based on the trend, one would expect 2014 to have a much higher amount of loans than 2013, instead, it has one of the lowest amount of loans. To investigate this, I looked into the highest month in 2014.
## [1] 3
March (3) is the last month in 2014 with any recorded Prosper loan data. As a result it makes sense that the amount of loans in 2014 is so much lower than in 2013.
The dataset consists of 113,937 loans (rows) with 81 variables (columns). The columns consist of data with multiple datatypes such as factors, integers, and numeric (decimals).
As mentioned, I’m interested in exploring how the socioeconomic factors affect the loan terms. Occupation, Employed Status, IsBorrowerHomeowner, DebtToIncomeRatio and StatedMonthlyIncome are some examples of such variables that I would like to dig in on and how they affect LoanStatus, LoanOriginalAmount, ProsperScore, BorrowerAPR, Term, and ListingCategory.
It will be interesting to examine some of the variables, such as LoanOriginalAmount against time.
For several factor columns, such as EmploymentStatus and LoanStatus, I created new columns to group some of the factors together to get a better understanding of the distribution of data.
In several cases I reordered the data or removed outliers to display more relevant sections of the graph. In the case of Occupation, I reordered the factors such that the histogram would display the occupations in descending order of popularity. In the cases of StatedMonthlyIncome and DebtToIncomeRatio I removed the top 1 percentile of data, which were distorting the graphs as outliers.
Based on the graph it doesn’t look like the loan amount affects the Borrower APR that much. It should be noted that higher loans have a smaller range of Borrower APR values at the lower end of the spectrum, but this is only for loans above the $20,000 mark.
Similarly, the relationship between the original loan amount and the debt to income ratio isn’t apparent. In both graphs we can see that people tend to take out larger loans at fixed values, for example, $10,000, $15,000, etc. The graph shows values that fall within the 99 percentile debt to income ratio range.
Looking at the distribution of the original loan amount to the stated monthly income, we can see a rough linear relationship between the original loan amount and a person’s stated monthly income. There is still a lot of variance in the values but compared to the previous two graphs, this is the first where the original loan amount has shown any remote dependence on a variable. The graph depicts income values that fall in the 99 percentile range.
When plotting the numeric Prosper rating against the original loan amount, we see that as the rating goes up, the median and maximum original loan amount increase. People with a Prosper rating of 7 (the maximum value) get loans ranging from $1,000 to $35,000 and not just at the higher end because even though they have a great Prosper score and can qualify for a larger loan, they may only need a small amount.
When plotting the numeric Prosper rating against borrower APR we can see a clear trend between a higher Prosper rating and a lower Borrower APR.
I want to dig into the factors affecting the original loan amount, borrower APR, and the Prosper rating a bit more to unearth any interesting facts and patterns.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.000 4.000 4.288 6.000 7.000 12630
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 3.00 4.00 3.83 5.00 7.00 16454
Digging into the Prosper rating a little more, I compared how being a homeowner affected the ratings. The heatmap shows the distribution of Prosper ratings for homeowners and non-homeowners. We can see that more homeowners have higher Prosper ratings than non-homeowners. The summary statistics show Prosper rating statistics for homeowners and non-homeowners respectively.
While the median ratings were the same for both homeowners and non-homeowners, the means and interquartile ranges differed, with homeowners having a higher mean and IQR as compared to non-homeowners.
A visualization of the median loan amount by the listing category shows that debt consolidation is the category with the highest median loans. The red line signifies the median loan amount ($6500) and only 5 listing categories have higher than average loan amounts. These are debt consolidation, baby&adoption, weddingloans, business, and boat.
Most listing categories experience higher than average borrower APRs as seen in the graph above.The median Borrower APR is 20.95%
By occupation, Pharmacists and Judges take out the highest median loans, while college sophmores take out the lowest median loans.
Pharmacists take the lead once again with the higest median borrower APR, while Judges (the runner up for median loan amounts) have the lowest median Borrower APR.
As seen in the 3rd scatterplot, when considering median loan amounts against income ranges, for the most part, the higher the income range, the higher the median loan taken. However, those with $0 incomes buck the trend and have higher median loans than those with $1-$24,999 incomes.
It’s a bit strange that someone can have no income and not be part of the’not employed’ group and still get a higher loan (on average) than people who are employed and make between $1-$24,999.
A similar trend follows for median borrower APR vs. Income Range, however, in the opposite direction. Those with higher income ranges get lower borrower APRs on average.
Once again people in the $0 income range buck the trend with one of the lowest median borrower APRs. Those that are not employed have the highest median Borrower APR.
Similar to original loan amounts, the median Prosper Rating (numeric) is higher for those in higher income range brackets.
When comparing median loan amounts to credit scores, there is a trend with people with higher credit scores having higher median loan amounts.
Similar to before, those with a credit score of 0 still receive loans of a similar median value as those with credit scores just below 500.
As with Income Ranges, the median borrower APR goes down with increasing credit scores.
When looking at the median loan amounts based on the status of the loan, we can see that the loans that were cancelled had the lowest median amount, while those that are current have the highest median amount, far exceeding the median across the entire group. As the current loans are the likely the more recent ones it suggests that people had been requesting and receiving much bigger loans more recently.
The Borrower APR is the highest for the past due and charged off loans. As we saw earlier, the numeric prosper Rating seems to be related to the Borrower APR - a lower rating results in a higher APR. It is not a surprise then that those with higher APRs are less likely to repay their loans.
Looking at the median value of loans taken out over time (months), we can see that the highest value loans are taken out at the bookends of the year, while lower value loans are taken out in the middle of the year. One potential reason could be the consolidation of credit card debt after the Christmas holidays.
When comparing the median loan value taken out over years, we can see a trend towards bigger loans being taken out over time on average. 2013 and 2014 are the only years where the median loan amount is higher than the overall median loan amounts, supporting the idea that the ‘Current’ loans with higher median values are more recent.
Majority of the bi-variate analysis looked into how socioeconomic factors affected the original loan amount, borrower APR, and occasionally the Prosper rating. It was interesting to note how the Prosper rating affected both the median loan values as well as the APR the borrowers had to pay. The analysis also revealed that the loan amount and borrower APR was sensitive to the borrower’s income range and credit score.
Looking at the value of the loans over time, both in months and years, revealed some interesting ancillary information. The median loans taken out varied depending on the time of the year, with the December to February period seeing the highest value of loans. Since 2005 we can also see how the loans that people have taken out with Prosper have grown in median value, with 2013 and 2014 median loan amounts almost 2x that of 2012.
The most pronounced relationship appears to be how income range proportionally varies with credit score ratings, and how borrower APR inversely vaaries with credit score ratings.
The three factors of interest, original loan amount; borrower APR; and the numeric Prosper rating, are plotted against each other, with the different Prosper ratings represented with different colours on the chart. From the visualization we can see that there are clear ranges of borrower APR that are defined by Prosper rating values. This further demonstrates the strong relationship between borrower APR and the Prosper rating.
There doesn’t appear to be as much vertical distinction between the loan amounts and the Prosper rating. As mentioned before, this is because people with high Prosper ratings can still elect to take out smaller loans. One can see however, that as the Prosper rating goes down, the maximum value of the original loan amount decreases.
Based on this data it appears that the Prosper rating plays a role in the interest that a borrower has to pay as well as the maximum value of the loan they can take out. According to Prosper, the ratings are determined using their proprietary algorithm. I want to use the data provided to find factors that influence the rating, and as a result, help determine whether an individual is more likely to get approved for a higher loan amount and lower borrower APR.
As stated monthly income and credit score had an impact on the loan amounts and borrower APR, the are plotted here to determine their impact on the Prosper rating. The credit score range has been edited to show values over 600, which better illustrate the trend. Similarly, the monthly income illustrates values that fall within 99 percentile of the data points.
We can see that lower credit scores and lower stated monthly incomes result in lower Prosper ratings, and vice versa for higher Prosper ratings.
When comparing the Debt to Income Ratio against the credit score, we see that lower debt to income ratios and higher credit scores result in higher Prosper ratings, and vice versa for lower Prosper ratings. The Debt to Income ratios displayed represent 99 percentile of the data.
When the Debt to Income ratio and the stated monthly income are compared we see a trend where lower debt to income ratios and higher monthly incomes result in higher Prosper ratings, and vice versa, albeit, the signal is not as strong as in the previous two graphs.
In the bivariate analysis we saw that being ahomeowner impacted the mean value of the prosper rating. I therefore wanted to see how this in addition to credit score, the DTI ratio, and monthly income, affected the prosper rating. Since the credit score was proportional to the Prosper rating, and the DTI ratio was inversely proportional to the Prosper rating, I created a new variable to represent the two together. CreditScore.DebtIncome is a combination of the credit score divided by the debt to income ratio to positively relate to the Prosper rating.
Plotting CreditScore.DebtIncome against the stated monthly income and faceting for homeowners revealed some insights. For one, we can see more darker dots at the top right corner of the graph where borrowers are homeowners. This implies that being a homeowner does impact ones chances of having a higher credit score. Those that don’t own a house also seem less likely to have a higher monthly income. Finally we can confirm that the new variable, CreditScore.DebtIncome is proportional to a higher Prosper rating for both homeowners and non-homeowners.
## [1] 0.2470109
Based on this trend I wanted to see how effective the credit score divided by the DTI ratio was as a metric to determine prosper scores. I decided to take the log base 10 of the CreditScore.DebtIncome variable to see how it matched up against the numeric Prosper ratings. The dots each represent individual data, while the red line respresents the median log10(CreditScore.DebtIncome) value as a function of the Prosper rating. We can see that there is a linear though somewhat flat relationsip between the variables.
Calculating the correlation between the variables resulted in a Pearson’s r value of 0.25, which is not very supportive of a relationship between the two variables. While this was a starting point I still need to look at more variables and their impact on the Prosper rating.
I kept CreditScore.DebtIncome in the y-axis as it still impacted Prosper ratings, but just didn;t tell the ful story. I compared it against the employment status duration, i.e., how long someone has been working for. I expected that those who had worked for longer would have higher Prosper scores. The results were not as expected, with employment duration not playing a big role in the value of the Prosper rating.
I moved on to look at open credit lines next. From the graph we can see that people with higher open credit lines and higher CreditScore.DebtIncome values have a higher Prosper rating, and vice versa for those with lower Prosper ratings.
I next looked at current delinquencies. As expected, those with higher delinquencies had lower Prosper ratings, however, majoirty of the people had 0 delinquencies and their Prosper ratings varied. Though it seems to be a weaker predictor of the Prosper rating as compared to CreditScore.DebtIncome, current delinquencies provides a means to determine whether a person should get a lower Prosper rating.
Looking at the graph above we can see that those with a low bank card utilization are more likely to have a higher Prosper ratings, and vice versa.
Plotting available bankcard credit against CreditScore.DebtIncome we can see that high CreditScore.DebtIncome values and high available bankcard credit resulted in higher credit scores and vice versa.
## [1] 0.541038
Based on the outcomes of this additional exploratory analysis we have seen how a few other financial factors impact the Prosper rating. I used this data to adjust the original log10(credit score/debt to income ratio) formula to predict the credit score. Intially I used all the variables explored to attempt to model the relationship between them and the Prosper rating, however, I quickly learnt that not all variables predicted the relationship strongly.
By tweaking the formula I was eventually left with 5 variables, credit score; available bankcard credit; stated monthly income; debt to income ratio; and current delinquencies, that resulted in a Pearson’s r of 0.54. These factors were combined through multiplication and division and then scaled to result in the variable PRvariables, which I then took the log of to result in the graph above.
PRvariables ~ (credit score x avaiable bankcard credit x stated monthly income) / ( debt to income ratio x current delinquencies)
The red line shows the median relationship between log10(PRvariables) and the Prosper rating. As seen on the graph, there is an almost linear, increasing trend between the two variables.
While this does not indicate strong correlation, the value is high enough to signal that I’m on the right path. I’m sure the factors are weighted in reality and that there are several other variables that influence the Prosper rating and that the combination of variables is more complicated than simply multiplying or dividing them.
Based on the multivariate analysis, it appears that from the variables I have explored, a combination of credit score, available bankcard credit, stated monthly income, debt to income ratio, and current delinquencies best explain the determination of the Prosper rating, which further impacts the maximum loan amounts and average borrower APR.
What I found surprising was that a higher number of open credit lines result in a better Prosper score. Most of the other relationships were intuitive and it was interesting to learn how our financial behaviours can impact the type of loan we can get.
The first plot depicts the reasons that people get loans with Prosper. As we can see, people are overwhelmingly using Prosper for debt reconsolidation. Knowing that most people have the same objective sets the context for what people expect when they get a loan from Prosper. I expect that Prosper offers better interest rates for these individuals than their current financing options do.
The second plot looks at two important factors that define a loan - how much you borrowed, and the annual interest you have to pay on the loan. It then compares these factors against the Prosper Rating, which uses a proprietary algorithm to determine a borrowers “loan-worthiness”. Higher ratings are better than lower ratings. I really like this visualisation because you can clearly see the distinctive Prosper ratings based on the Borrower APR. We can also see that the maximum loans that people get are a lot lower for those with ratings below 3.
The final graph shows the relationship of PRvariables, a variable that combines the factors that influence the Prosper rating, which in turns influences the maximum loan you can get approved for, and the annual interest that you would pay on the loan. Based on the findings, the factors I investigated that more strongly impact your Prosper rating are your credit score, available credit, monthly income, debt to income ratio, current delinquencies. As your credit score, available credit, and monthly income increase, your Prosper rating increases. As your debt to income ratio and current delinquencies decrease your Prosper rating goes up. Using this information, people looking to refinance their debt and get the best interest rate can do a pre-assessment and make adjustments to their situation before applying for a loan.
The Prosper dataset contained information about 113,937 loans with 81 variables each. The data included information about the loan (amount, status, interest, term, category, rating), the borrowers (socioeconomic data), and the lenders (number of lenders, yield). Given the vast number of variables, I decided to focus on aspect of the data to get more meaningful insights. I chose to look at how a borrowers socioeconomic status impacts the type of loan they receive. In particular I focused on the original value of the loan and the borrower APR (annual percentage rate) for the terms of the loan. I picked the APR as a metric as it’s representative of the total costs that the borrower has to bear.
Initially I explored the loan data and socioeconomic data independantly. This led to insights about what people use Prosper loans for, and typically how much they borrow, and at what rate. I then started to compare these variables against each other to unearth any relationships between them. Through performing bivariate analyses I learnt that the value of the loan and the APR are influenced by the Prosper rating, the credit score, and the income range. Digging deeper an combining these variables and more using multivariate analysis I was able to discover that the Prosper rating was the main factor that influenced the APR borrowers received. Its a number that’s generated by Prosper that determined the “loan-worthiness” of a borrower.
My next step was to figure out what factors influence the opaque Prosper rating. Given the plethora of variables to choose from, this step took time as I made chart after chart, trying to find variables that related to the Prosper rating in any way. Once I had narrowed down the variables that impacted the Prosper rating, I was curious to see if I could combine them in a way that would eventually result in a linear relationship between these combined variables and the Prosper rating. This resulted in an iterative process of testing variables, graphing them, and checking their correlation to the Prosper rating as a measure of which equations were ultimately better than the others. Eventually I settled on 5 variables that impacted the Prosper rating the most. These were the credit score, available credit, monthly income, debt to income ratio, current delinquencies.
Though it was fulfiling to identify some of the factors that contribute to the Prosper rating, I realise that there are several more factors in different combinations that impact the Prosper rating. My conclusions are limited by the data available. An interesting extension of this work would be to compare the Prosper rating with the actual result of the loan. Given more recent data (from 2014 to 2017) we would have more paid/defaulted loans and be able to close the loop and see how well the Prosper ratings were able to predict the actual riskiness of the loan.